The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.
# Installing the libraries with the specified version.
# uncomment and run the following line if Google Colab is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 yellowbrick==1.5 -q --user
# I will use the latest version of the library as that will best prepare me for a job in the data science field.
# Installing the libraries with the specified version.
# uncomment and run the following lines if Jupyter Notebook is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.2 yellowbrick==1.5 -q --user
# !pip install --upgrade -q jinja2
# Reading, manipulating, and visualizing data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Scaling Data
from sklearn.preprocessing import StandardScaler
# Distances between clusters, k-means, and silhouette
from scipy.spatial.distance import cdist, pdist
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# Hierarchical clustering, cophenetic correlation, and dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# to suppress warnings
import warnings
warnings.filterwarnings("ignore")
# mount my google drive
from google.colab import drive
drive.mount('/content/drive')
# read the csv file to a data frame
df=pd.read_csv('/content/drive/MyDrive/Python Foundations - GreatLearning/stock_data.csv')
data=df.copy()
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
# Get an idea what the data looks like
data.head()
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 |
data.tail()
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 335 | YHOO | Yahoo Inc. | Information Technology | Internet Software & Services | 33.259998 | 14.887727 | 1.845149 | 15 | 459 | -1032187000 | -4359082000 | -4.64 | 939457327.6 | 28.976191 | 6.261775 |
| 336 | YUM | Yum! Brands Inc | Consumer Discretionary | Restaurants | 52.516175 | -8.698917 | 1.478877 | 142 | 27 | 159000000 | 1293000000 | 2.97 | 435353535.4 | 17.682214 | -3.838260 |
| 337 | ZBH | Zimmer Biomet Holdings | Health Care | Health Care Equipment | 102.589996 | 9.347683 | 1.404206 | 1 | 100 | 376000000 | 147000000 | 0.78 | 188461538.5 | 131.525636 | -23.884449 |
| 338 | ZION | Zions Bancorp | Financials | Regional Banks | 27.299999 | -1.158588 | 1.468176 | 4 | 99 | -43623000 | 309471000 | 1.20 | 257892500.0 | 22.749999 | -0.063096 |
| 339 | ZTS | Zoetis | Health Care | Pharmaceuticals | 47.919998 | 16.678836 | 1.610285 | 32 | 65 | 272000000 | 339000000 | 0.68 | 498529411.8 | 70.470585 | 1.723068 |
# see how many rows and columns there are
data.shape
(340, 15)
# figure out the data types of each column
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ticker Symbol 340 non-null object 1 Security 340 non-null object 2 GICS Sector 340 non-null object 3 GICS Sub Industry 340 non-null object 4 Current Price 340 non-null float64 5 Price Change 340 non-null float64 6 Volatility 340 non-null float64 7 ROE 340 non-null int64 8 Cash Ratio 340 non-null int64 9 Net Cash Flow 340 non-null int64 10 Net Income 340 non-null int64 11 Earnings Per Share 340 non-null float64 12 Estimated Shares Outstanding 340 non-null float64 13 P/E Ratio 340 non-null float64 14 P/B Ratio 340 non-null float64 dtypes: float64(7), int64(4), object(4) memory usage: 40.0+ KB
# check if any rows are repeated
data.duplicated().sum()
0
# check what the percentage breakdown of industries
data['GICS Sector'].value_counts(normalize=True)
| proportion | |
|---|---|
| GICS Sector | |
| Industrials | 0.155882 |
| Financials | 0.144118 |
| Health Care | 0.117647 |
| Consumer Discretionary | 0.117647 |
| Information Technology | 0.097059 |
| Energy | 0.088235 |
| Real Estate | 0.079412 |
| Utilities | 0.070588 |
| Materials | 0.058824 |
| Consumer Staples | 0.055882 |
| Telecommunications Services | 0.014706 |
We can see that some of the industry type are more represented than others. This may or may not be reflected in the clusters later.
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# create a subset of the data frame that only include columns with numeric values
data_numeric = data.select_dtypes(include=[np.number])
# create a loop that creates a hisogram and boxplot for each numeric column
for feature in data_numeric.columns:
histogram_boxplot(data_numeric, feature, figsize=(12, 7), kde=False, bins=None)
It seems that nearly every column in this data set has outliers, many of which are extreme outliers. Because we are only trying to group the data into similar clusters and not predict a target variable, this should be fine. If data imputation was needed, median would be used as the outliers could skew the data. That being said, this data set has no missing values.
Questions:
What does the distribution of stock prices look like?
# use the histogram_boxplot function to show current price distribution
histogram_boxplot(data, 'Current Price', figsize=(12, 7), kde=True, bins=None)
We can see the data is extremely right skewed indicating some stock prices with much higher values. This does not mean those companies are more valuable as they may simply have fewer stocks at a higher price.
The stocks of which economic sector have seen the maximum price increase on average?
# calculate the mean price increase by gcis sector
data.groupby('GICS Sector')['Price Change'].mean().sort_values(ascending=False)
| Price Change | |
|---|---|
| GICS Sector | |
| Health Care | 9.585652 |
| Consumer Staples | 8.684750 |
| Information Technology | 7.217476 |
| Telecommunications Services | 6.956980 |
| Real Estate | 6.205548 |
| Consumer Discretionary | 5.846093 |
| Materials | 5.589738 |
| Financials | 3.865406 |
| Industrials | 2.833127 |
| Utilities | 0.803657 |
| Energy | -10.228289 |
We can see the Health Care sector has done particularly well and the Energy sector has done particularly poorly.
How are the different variables correlated with each other?
# make a heatmap that shows how each column in the dataframe correlates with each other column in the dataframe
sns.heatmap(data_numeric.corr(), annot=True, cmap="coolwarm")
fig = plt.gcf()
fig.set_size_inches(10, 8)
plt.show()
We can see that the correlations are relatively low overall. All correlations have a magnitude less than 60% correlated. The two most correlated variables are Net Income and Estimated Shares Outstanding.
Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?
# show the mean cash ratio grouped by gcis sector
data.groupby('GICS Sector')['Cash Ratio'].mean().sort_values(ascending=False)
| Cash Ratio | |
|---|---|
| GICS Sector | |
| Information Technology | 149.818182 |
| Telecommunications Services | 117.000000 |
| Health Care | 103.775000 |
| Financials | 98.591837 |
| Consumer Staples | 70.947368 |
| Energy | 51.133333 |
| Real Estate | 50.111111 |
| Consumer Discretionary | 49.575000 |
| Materials | 41.700000 |
| Industrials | 36.188679 |
| Utilities | 13.625000 |
We can see that IT has more cash on hand to cover its short-term obligations while Utilities has much less cash on hand than the other industry types.
P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. How does the P/E ratio vary, on average, across economic sectors?
# show mean P/E ratios grouped by gcis sector
data.groupby('GICS Sector')['P/E Ratio'].mean().sort_values(ascending=False)
| P/E Ratio | |
|---|---|
| GICS Sector | |
| Energy | 72.897709 |
| Information Technology | 43.782546 |
| Real Estate | 43.065585 |
| Health Care | 41.135272 |
| Consumer Discretionary | 35.211613 |
| Consumer Staples | 25.521195 |
| Materials | 24.585352 |
| Utilities | 18.719412 |
| Industrials | 18.259380 |
| Financials | 16.023151 |
| Telecommunications Services | 12.222578 |
This is interesting because we can see that despite having the largest average negative price change (see EDA Question 1), Energy has the highest P/E ratio meaning people are willing to purchase the shares at a higher average price relative to dollars earned than any other sector. This speaks highly to stockholder's confidence in the energy sector.
# duplicate value check
data.duplicated().sum()
0
# missing value treatment
# check for missing values
data.isna().sum()
| 0 | |
|---|---|
| Ticker Symbol | 0 |
| Security | 0 |
| GICS Sector | 0 |
| GICS Sub Industry | 0 |
| Current Price | 0 |
| Price Change | 0 |
| Volatility | 0 |
| ROE | 0 |
| Cash Ratio | 0 |
| Net Cash Flow | 0 |
| Net Income | 0 |
| Earnings Per Share | 0 |
| Estimated Shares Outstanding | 0 |
| P/E Ratio | 0 |
| P/B Ratio | 0 |
We observed during the EDA that there were outliers for almost every numeric column, some of which being quite extreme. There aren't many rows in our dataframe, only 340. While these outliers could affect the shape of our clusters, removing them could have a significant impact as well and we have no reason to believe the outlier data is false in any way. We will leave the outliers in.
# scaling should be done before encoding dummy variables. We don't really care about the variance or standard deviation of binary variables meant for tracking categories
# Scale the numeric columns in the original data frame
scale_data = data.copy()
scaler = StandardScaler()
scale_data[data_numeric.columns] = scaler.fit_transform(scale_data[data_numeric.columns])
scale_data.head()
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | -0.393341 | 0.493950 | 0.272749 | 0.989601 | -0.210698 | -0.339355 | 1.554415 | 1.309399 | 0.107863 | -0.652487 | -0.506653 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | -0.220837 | 0.355439 | 1.137045 | 0.937737 | 0.077269 | -0.002335 | 0.927628 | 0.056755 | 1.250274 | -0.311769 | -0.504205 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | -0.367195 | 0.602479 | -0.427007 | -0.192905 | -0.033488 | 0.454058 | 0.744371 | 0.024831 | 1.098021 | -0.391502 | 0.094941 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 0.133567 | 0.825696 | -0.284802 | -0.317379 | 1.218059 | -0.152497 | -0.219816 | -0.230563 | -0.091622 | 0.947148 | 0.424333 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | -0.260874 | -0.492636 | 0.296470 | -0.265515 | 2.237018 | 0.133564 | -0.202703 | -0.374982 | 1.978399 | 3.293307 | 0.199196 |
It appears that our data has been correctly scaled. Next, we need to remove the Ticker Symbol column and the Security column. These columns essentially work as names and do not add any information to our clusters.
# remove Ticker Symbol and Security columns from the data frame
scale_data.drop(['Ticker Symbol', 'Security'], axis=1, inplace=True)
scale_data.head()
| GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Industrials | Airlines | -0.393341 | 0.493950 | 0.272749 | 0.989601 | -0.210698 | -0.339355 | 1.554415 | 1.309399 | 0.107863 | -0.652487 | -0.506653 |
| 1 | Health Care | Pharmaceuticals | -0.220837 | 0.355439 | 1.137045 | 0.937737 | 0.077269 | -0.002335 | 0.927628 | 0.056755 | 1.250274 | -0.311769 | -0.504205 |
| 2 | Health Care | Health Care Equipment | -0.367195 | 0.602479 | -0.427007 | -0.192905 | -0.033488 | 0.454058 | 0.744371 | 0.024831 | 1.098021 | -0.391502 | 0.094941 |
| 3 | Information Technology | Application Software | 0.133567 | 0.825696 | -0.284802 | -0.317379 | 1.218059 | -0.152497 | -0.219816 | -0.230563 | -0.091622 | 0.947148 | 0.424333 |
| 4 | Information Technology | Semiconductors | -0.260874 | -0.492636 | 0.296470 | -0.265515 | 2.237018 | 0.133564 | -0.202703 | -0.374982 | 1.978399 | 3.293307 | 0.199196 |
Because there are SO MANY unique values in GICS Sector and GICS Sub Industry, I am going to move forward with three new data sets as described below:
data1 is a copy of the data frame with both GICS Sector and GICS Sub Industry dummy encoded. The concern here will be curse of dimensionality as we are increasing the number of columns from 13 to 124 even with drop_first=True
data2 is a copy of the data frame that drops both GICS Sector and GICS Sub Industry. This data frame may lose too much information to be useful but will not suffer from the curse of dimensionality at all
data1=scale_data.copy()
data2=scale_data.copy()
# for data1, encode dummy variables for GICS Sector and GICS Sub Industry
data1 = pd.get_dummies(data1, columns=['GICS Sector', 'GICS Sub Industry'], drop_first=True)
# for each column with data type bool in data1, map "True" to "1" and map "False" to "0" and change type to int
for col in data1.columns:
if data1[col].dtype == 'bool':
data1[col] = data1[col].map({True: 1, False: 0})
data1[col] = data1[col].astype(int)
# for data2, drop GICS Sector and GICS Sub Industry
data2.drop(['GICS Sector', 'GICS Sub Industry'], axis=1, inplace=True)
data1.head()
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | GICS Sector_Consumer Staples | GICS Sector_Energy | GICS Sector_Financials | GICS Sector_Health Care | GICS Sector_Industrials | GICS Sector_Information Technology | GICS Sector_Materials | GICS Sector_Real Estate | GICS Sector_Telecommunications Services | GICS Sector_Utilities | GICS Sub Industry_Aerospace & Defense | GICS Sub Industry_Agricultural Products | GICS Sub Industry_Air Freight & Logistics | GICS Sub Industry_Airlines | GICS Sub Industry_Alternative Carriers | GICS Sub Industry_Apparel, Accessories & Luxury Goods | GICS Sub Industry_Application Software | GICS Sub Industry_Asset Management & Custody Banks | GICS Sub Industry_Auto Parts & Equipment | GICS Sub Industry_Automobile Manufacturers | GICS Sub Industry_Banks | GICS Sub Industry_Biotechnology | GICS Sub Industry_Brewers | GICS Sub Industry_Broadcasting & Cable TV | GICS Sub Industry_Building Products | GICS Sub Industry_Cable & Satellite | GICS Sub Industry_Casinos & Gaming | GICS Sub Industry_Computer Hardware | GICS Sub Industry_Construction & Farm Machinery & Heavy Trucks | GICS Sub Industry_Construction Materials | GICS Sub Industry_Consumer Electronics | GICS Sub Industry_Consumer Finance | GICS Sub Industry_Copper | GICS Sub Industry_Data Processing & Outsourced Services | GICS Sub Industry_Distributors | GICS Sub Industry_Diversified Chemicals | GICS Sub Industry_Diversified Commercial Services | GICS Sub Industry_Diversified Financial Services | GICS Sub Industry_Drug Retail | GICS Sub Industry_Electric Utilities | GICS Sub Industry_Electrical Components & Equipment | GICS Sub Industry_Electronic Components | GICS Sub Industry_Electronic Equipment & Instruments | GICS Sub Industry_Environmental Services | GICS Sub Industry_Fertilizers & Agricultural Chemicals | GICS Sub Industry_Financial Exchanges & Data | GICS Sub Industry_Gold | GICS Sub Industry_Health Care Distributors | GICS Sub Industry_Health Care Equipment | GICS Sub Industry_Health Care Facilities | GICS Sub Industry_Health Care Supplies | GICS Sub Industry_Home Entertainment Software | GICS Sub Industry_Home Furnishings | GICS Sub Industry_Homebuilding | GICS Sub Industry_Hotels, Resorts & Cruise Lines | GICS Sub Industry_Household Appliances | GICS Sub Industry_Household Products | GICS Sub Industry_Housewares & Specialties | GICS Sub Industry_Human Resource & Employment Services | GICS Sub Industry_IT Consulting & Other Services | GICS Sub Industry_Industrial Conglomerates | GICS Sub Industry_Industrial Gases | GICS Sub Industry_Industrial Machinery | GICS Sub Industry_Industrial Materials | GICS Sub Industry_Insurance Brokers | GICS Sub Industry_Integrated Oil & Gas | GICS Sub Industry_Integrated Telecommunications Services | GICS Sub Industry_Internet & Direct Marketing Retail | GICS Sub Industry_Internet Software & Services | GICS Sub Industry_Investment Banking & Brokerage | GICS Sub Industry_Leisure Products | GICS Sub Industry_Life & Health Insurance | GICS Sub Industry_Life Sciences Tools & Services | GICS Sub Industry_Managed Health Care | GICS Sub Industry_Metal & Glass Containers | GICS Sub Industry_Motorcycle Manufacturers | GICS Sub Industry_Multi-Sector Holdings | GICS Sub Industry_Multi-line Insurance | GICS Sub Industry_MultiUtilities | GICS Sub Industry_Networking Equipment | GICS Sub Industry_Office REITs | GICS Sub Industry_Oil & Gas Equipment & Services | GICS Sub Industry_Oil & Gas Exploration & Production | GICS Sub Industry_Oil & Gas Refining & Marketing & Transportation | GICS Sub Industry_Packaged Foods & Meats | GICS Sub Industry_Paper Packaging | GICS Sub Industry_Personal Products | GICS Sub Industry_Pharmaceuticals | GICS Sub Industry_Property & Casualty Insurance | GICS Sub Industry_Publishing | GICS Sub Industry_REITs | GICS Sub Industry_Railroads | GICS Sub Industry_Real Estate Services | GICS Sub Industry_Regional Banks | GICS Sub Industry_Research & Consulting Services | GICS Sub Industry_Residential REITs | GICS Sub Industry_Restaurants | GICS Sub Industry_Retail REITs | GICS Sub Industry_Semiconductor Equipment | GICS Sub Industry_Semiconductors | GICS Sub Industry_Soft Drinks | GICS Sub Industry_Specialized REITs | GICS Sub Industry_Specialty Chemicals | GICS Sub Industry_Specialty Retail | GICS Sub Industry_Specialty Stores | GICS Sub Industry_Steel | GICS Sub Industry_Technology Hardware, Storage & Peripherals | GICS Sub Industry_Technology, Hardware, Software and Supplies | GICS Sub Industry_Thrifts & Mortgage Finance | GICS Sub Industry_Tires & Rubber | GICS Sub Industry_Tobacco | GICS Sub Industry_Trucking | GICS Sub Industry_Water Utilities | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.393341 | 0.493950 | 0.272749 | 0.989601 | -0.210698 | -0.339355 | 1.554415 | 1.309399 | 0.107863 | -0.652487 | -0.506653 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | -0.220837 | 0.355439 | 1.137045 | 0.937737 | 0.077269 | -0.002335 | 0.927628 | 0.056755 | 1.250274 | -0.311769 | -0.504205 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | -0.367195 | 0.602479 | -0.427007 | -0.192905 | -0.033488 | 0.454058 | 0.744371 | 0.024831 | 1.098021 | -0.391502 | 0.094941 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0.133567 | 0.825696 | -0.284802 | -0.317379 | 1.218059 | -0.152497 | -0.219816 | -0.230563 | -0.091622 | 0.947148 | 0.424333 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | -0.260874 | -0.492636 | 0.296470 | -0.265515 | 2.237018 | 0.133564 | -0.202703 | -0.374982 | 1.978399 | 3.293307 | 0.199196 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Because the variance and standard deviation of columns were maintained after scaling and no outliers were removed, all the graphs would look identical except that the scale of the units would be less readable. Secondary EDA is not necessary in this case.
As a reminder: data1 is a copy of the data frame with both GICS Sector and GICS Sub Industry dummy encoded. The concern here will be curse of dimensionality as we are increasing the number of columns from 13 to 124 even with drop_first=True
# make a copy of data1 so we don't change it so we can use data1 for hierarchical clustering later
k_means_df1 = data1.copy()
clusters = range(1, 15)
meanDistortions = []
for k in clusters:
model = KMeans(n_clusters=k, random_state=1)
model.fit(data1)
prediction = model.predict(k_means_df1)
distortion = (
sum(np.min(cdist(k_means_df1, model.cluster_centers_, "euclidean"), axis=1))
/ k_means_df1.shape[0]
)
meanDistortions.append(distortion)
print("Number of Clusters:", k, "\tAverage Distortion:", distortion)
plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20)
plt.show()
Number of Clusters: 1 Average Distortion: 2.964689739825615 Number of Clusters: 2 Average Distortion: 2.8683405676145317 Number of Clusters: 3 Average Distortion: 2.803663383067996 Number of Clusters: 4 Average Distortion: 2.6549208303886864 Number of Clusters: 5 Average Distortion: 2.5709688021715866 Number of Clusters: 6 Average Distortion: 2.5215253702526255 Number of Clusters: 7 Average Distortion: 2.4627896099777917 Number of Clusters: 8 Average Distortion: 2.43557403792719 Number of Clusters: 9 Average Distortion: 2.3905822333386255 Number of Clusters: 10 Average Distortion: 2.3356012840014118 Number of Clusters: 11 Average Distortion: 2.2955311559289453 Number of Clusters: 12 Average Distortion: 2.253550018031723 Number of Clusters: 13 Average Distortion: 2.2073357384639367 Number of Clusters: 14 Average Distortion: 2.1815365543372605
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(1, 15), timings=True)
visualizer.fit(k_means_df1) # fit the data to the visualizer
visualizer.show() # finalize and render figure
<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
# silhouette scores
silhouette_scores = []
cluster_list = range(2,15)
for k in range(2, 15):
kmeans = KMeans(n_clusters=k, random_state=1).fit(k_means_df1)
predictions = kmeans.fit_predict(k_means_df1)
score = silhouette_score(k_means_df1, predictions)
silhouette_scores.append(score)
print("Number of Clusters:", k, "\tAverage Silhouette Score:", score)
plt.plot(cluster_list, silhouette_scores)
plt.xlabel("Number of Clusters")
plt.ylabel("Average Silhouette Score")
plt.title("Selecting k with the Silhouette Method", fontsize=20)
plt.show()
Number of Clusters: 2 Average Silhouette Score: 0.3077609091830765 Number of Clusters: 3 Average Silhouette Score: 0.3183211990235526 Number of Clusters: 4 Average Silhouette Score: 0.34517209690198636 Number of Clusters: 5 Average Silhouette Score: 0.317171879589086 Number of Clusters: 6 Average Silhouette Score: 0.321540601063231 Number of Clusters: 7 Average Silhouette Score: 0.33899989921610313 Number of Clusters: 8 Average Silhouette Score: 0.30706456773522595 Number of Clusters: 9 Average Silhouette Score: 0.31159194506849025 Number of Clusters: 10 Average Silhouette Score: 0.08598883424124792 Number of Clusters: 11 Average Silhouette Score: 0.08107759543523096 Number of Clusters: 12 Average Silhouette Score: 0.08805038973278116 Number of Clusters: 13 Average Silhouette Score: 0.09375702541346734 Number of Clusters: 14 Average Silhouette Score: 0.09578160880917934
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(2,15), metric="silhouette", timings=True)
visualizer.fit(k_means_df1) # fit the data to the visualizer
visualizer.show() # finalize and render figure
<Axes: title={'center': 'Silhouette Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='silhouette score'>
# finding optimal number of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(n_clusters=4, random_state=1))
visualizer.fit(k_means_df1)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# finding optimal number of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(n_clusters=6, random_state=1))
visualizer.fit(k_means_df1)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 6 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
The case for four and six clusters are both very strong. I think for both cases, overall the grouping for the clusters is weak. The average silhouette score is below 0.35 for both options. For 4 and 6 clusters, both cases have a cluster with a silhouette score that is more negative than positive. I think we care slightly more about clustering quality than in-cluster variance so I am inclined to lean towards the 4-cluster grouping.
# final kmeans model for data1
kmeans = KMeans(n_clusters=4, random_state=1)
kmeans.fit(k_means_df1)
KMeans(n_clusters=4, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=4, random_state=1)
# create a copy from original copy
dfkm1 = data.copy() # name comes from data frame kmeans on data 1
# adding kmeans cluster labels to the copy of the original copy and to the scaled data frame data1
dfkm1['Cluster'] = kmeans.labels_
k_means_df1['Cluster'] = kmeans.labels_
# cluster profiling
km_cluster_profile_1 = dfkm1.groupby('Cluster').mean(numeric_only=True)
km_cluster_profile_1.head()
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Cluster | |||||||||||
| 0 | 75.355204 | -7.350601 | 2.540224 | 35.975610 | 148.268293 | 2.124745e+08 | -1.161346e+09 | -2.648293 | 5.918493e+08 | 92.900103 | 2.609334 |
| 1 | 83.392891 | 5.848839 | 1.376700 | 34.515789 | 58.673684 | 8.323276e+07 | 1.562137e+09 | 3.951316 | 4.322491e+08 | 24.217338 | -2.266772 |
| 2 | 26.990000 | -14.060688 | 3.296307 | 603.000000 | 57.333333 | -5.850000e+08 | -1.755567e+10 | -39.726667 | 4.819101e+08 | 71.528835 | 1.638633 |
| 3 | 50.517273 | 5.747586 | 1.130399 | 31.090909 | 75.909091 | -1.072273e+09 | 1.483309e+10 | 4.154545 | 4.298827e+09 | 14.803577 | -4.552119 |
km_cluster_profile_1["Count of Companies in this Cluster"] = (dfkm1.groupby('Cluster')["Security"].count().values)
km_cluster_profile_1.head()
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | Count of Companies in this Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cluster | ||||||||||||
| 0 | 75.355204 | -7.350601 | 2.540224 | 35.975610 | 148.268293 | 2.124745e+08 | -1.161346e+09 | -2.648293 | 5.918493e+08 | 92.900103 | 2.609334 | 41 |
| 1 | 83.392891 | 5.848839 | 1.376700 | 34.515789 | 58.673684 | 8.323276e+07 | 1.562137e+09 | 3.951316 | 4.322491e+08 | 24.217338 | -2.266772 | 285 |
| 2 | 26.990000 | -14.060688 | 3.296307 | 603.000000 | 57.333333 | -5.850000e+08 | -1.755567e+10 | -39.726667 | 4.819101e+08 | 71.528835 | 1.638633 | 3 |
| 3 | 50.517273 | 5.747586 | 1.130399 | 31.090909 | 75.909091 | -1.072273e+09 | 1.483309e+10 | 4.154545 | 4.298827e+09 | 14.803577 | -4.552119 | 11 |
# print which companies are in each cluster
for cluster in dfkm1['Cluster'].unique():
print(f'Cluster {cluster}')
print(dfkm1[dfkm1['Cluster'] == cluster]['Security'].unique())
print('')
Cluster 1 ['American Airlines Group' 'AbbVie' 'Abbott Laboratories' 'Adobe Systems Inc' 'Archer-Daniels-Midland Co' 'Alliance Data Systems' 'Ameren Corp' 'American Electric Power' 'AFLAC Inc' 'American International Group, Inc.' 'Apartment Investment & Mgmt' 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp' 'Allegion' 'Applied Materials Inc' 'AMETEK Inc' 'Affiliated Managers Group Inc' 'Amgen Inc' 'Ameriprise Financial' 'American Tower Corp A' 'AutoNation Inc' 'Anthem Inc.' 'Aon plc' 'Amphenol Corp' 'Arconic Inc' 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'Broadcom' 'American Water Works Company Inc' 'American Express Co' 'Boeing Company' 'Bank of America Corp' 'Baxter International Inc.' 'BB&T Corporation' 'Bard (C.R.) Inc.' 'BIOGEN IDEC Inc.' 'The Bank of New York Mellon Corp.' 'Ball Corp' 'Bristol-Myers Squibb' 'Boston Scientific' 'BorgWarner' 'Boston Properties' 'Caterpillar Inc.' 'Chubb Limited' 'CBRE Group' 'Crown Castle International Corp.' 'Carnival Corp.' 'CF Industries Holdings Inc' 'Citizens Financial Group' 'Church & Dwight' 'C. H. Robinson Worldwide' 'Charter Communications' 'CIGNA Corp.' 'Cincinnati Financial' 'Colgate-Palmolive' 'Comerica Inc.' 'CME Group Inc.' 'Cummins Inc.' 'CMS Energy' 'Centene Corporation' 'CenterPoint Energy' 'Capital One Financial' 'The Cooper Companies' 'CSX Corp.' 'CenturyLink Inc' 'Cognizant Technology Solutions' 'Citrix Systems' 'CVS Health' 'Chevron Corp.' 'Dominion Resources' 'Delta Air Lines' 'Du Pont (E.I.)' 'Deere & Co.' 'Discover Financial Services' 'Quest Diagnostics' 'Danaher Corp.' 'The Walt Disney Company' 'Discovery Communications-A' 'Discovery Communications-C' 'Delphi Automotive' 'Digital Realty Trust' 'Dun & Bradstreet' 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy' 'DaVita Inc.' 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison' 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'Equinix' 'Equity Residential' 'Eversource Energy' 'Essex Property Trust, Inc.' 'E*Trade' 'Eaton Corporation' 'Entergy Corp.' 'Edwards Lifesciences' 'Exelon Corp.' "Expeditors Int'l" 'Expedia Inc.' 'Extra Space Storage' 'Fastenal Co' 'Fortune Brands Home & Security' 'FirstEnergy Corp' 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems' 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation' 'Federal Realty Investment Trust' 'First Solar Inc' 'General Dynamics' 'General Growth Properties Inc.' 'Corning Inc.' 'General Motors' 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber' 'Grainger (W.W.) Inc.' 'Hasbro Inc.' 'Huntington Bancshares' 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.' 'Hartford Financial Svc.Gp.' 'Harley-Davidson' "Honeywell Int'l Inc." 'HP Inc.' 'Hormel Foods Corp.' 'Henry Schein' 'Host Hotels & Resorts' 'The Hershey Company' 'Humana Inc.' 'International Business Machines' 'IDEXX Laboratories' 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group' 'Iron Mountain Incorporated' 'Intuitive Surgical Inc.' 'Illinois Tool Works' 'Invesco Ltd.' 'J. B. Hunt Transport Services' 'Jacobs Engineering Group' 'Juniper Networks' 'Kimco Realty' 'Kimberly-Clark' 'Kansas City Southern' 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding' 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.' 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.' 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell' 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l." 'Masco Corp.' 'Mattel Inc.' "McDonald's Corp." "Moody's Corp" 'Mondelez International' 'MetLife Inc.' 'Mohawk Industries' 'Mead Johnson' 'McCormick & Co.' 'Martin Marietta Materials' 'Marsh & McLennan' '3M Company' 'Altria Group Inc' 'Marathon Petroleum' 'Merck & Co.' 'M&T Bank Corp.' 'Mettler Toledo' 'Mylan N.V.' 'Navient' 'NASDAQ OMX Group' 'NextEra Energy' 'Nielsen Holdings' 'Norfolk Southern Corp.' 'Northern Trust Corp.' 'Nucor Corp.' 'Newell Brands' 'Realty Income Corporation' 'Omnicom Group' "O'Reilly Automotive" "People's United Financial" 'Pitney-Bowes' 'PACCAR Inc.' 'PG&E Corp.' 'Priceline.com Inc' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.' 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.' 'Pulte Homes Inc.' 'Philip Morris International' 'PNC Financial Services' 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.' 'Prudential Financial' 'Phillips 66' 'Praxair Inc.' 'PayPal' 'Ryder System' 'Royal Caribbean Cruises Ltd' 'Regeneron' 'Robert Half International' 'Roper Industries' 'Republic Services Inc' 'SCANA Corp' 'Charles Schwab Corporation' 'Sealed Air' 'Sherwin-Williams' 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Southern Co.' 'Simon Property Group Inc' 'S&P Global, Inc.' 'Stericycle Inc' 'Sempra Energy' 'SunTrust Banks' 'State Street Corp.' 'Synchrony Financial' 'Stryker Corp.' 'Molson Coors Brewing Company' 'Tegna, Inc.' 'Torchmark Corp.' 'Thermo Fisher Scientific' 'TripAdvisor' 'The Travelers Companies Inc.' 'Tractor Supply Company' 'Tyson Foods' 'Tesoro Petroleum Co.' 'Total System Services' 'Texas Instruments' 'Under Armour' 'United Continental Holdings' 'UDR Inc' 'Universal Health Services, Inc.' 'United Health Group Inc.' 'Unum Group' 'Union Pacific' 'United Parcel Service' 'United Technologies' 'Varian Medical Systems' 'Valero Energy' 'Vulcan Materials' 'Vornado Realty Trust' 'Verisk Analytics' 'Verisign Inc.' 'Ventas Inc' 'Waters Corporation' 'Wec Energy Group Inc' 'Whirlpool Corp.' 'Waste Management Inc.' 'Western Union Co' 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Xcel Energy Inc' 'XL Capital' 'Dentsply Sirona' 'Xerox Corp.' 'Xylem Inc.' 'Yum! Brands Inc' 'Zimmer Biomet Holdings' 'Zions Bancorp' 'Zoetis'] Cluster 0 ['Analog Devices, Inc.' 'Akamai Technologies Inc' 'Alexion Pharmaceuticals' 'Amazon.com Inc' 'Anadarko Petroleum Corp' 'Baker Hughes Inc' 'Celgene Corp.' 'Chipotle Mexican Grill' 'Cabot Oil & Gas' 'Concho Resources' 'EOG Resources' 'EQT Corporation' 'Facebook' 'Freeport-McMoran Cp & Gld' 'Frontier Communications' 'Halliburton Co.' 'Hess Corporation' 'Hewlett Packard Enterprise' 'Kinder Morgan' 'Monster Beverage' 'The Mosaic Company' 'Marathon Oil Corp.' 'Murphy Oil' 'Noble Energy Inc' 'Newmont Mining Corp. (Hldg. Co.)' 'Netflix Inc.' 'Newfield Exploration Co' 'National Oilwell Varco Inc.' 'ONEOK' 'Occidental Petroleum' 'Quanta Services Inc.' 'Range Resources Corp.' 'Spectra Energy Corp.' 'Skyworks Solutions' 'Southwestern Energy' 'Teradata Corp.' 'Vertex Pharmaceuticals Inc' 'Williams Cos.' 'Wynn Resorts Ltd' 'Cimarex Energy' 'Yahoo Inc.'] Cluster 2 ['Apache Corporation' 'Chesapeake Energy' 'Devon Energy Corp.'] Cluster 3 ['Citigroup Inc.' 'Ford Motor' 'Gilead Sciences' 'Intel Corp.' 'JPMorgan Chase & Co.' 'Coca Cola Company' 'Pfizer Inc.' 'AT&T Inc' 'Verizon Communications' 'Wells Fargo' 'Exxon Mobil Corp.']
dfkm1.groupby(["Cluster", "GICS Sector"])['Security'].count()
| Security | ||
|---|---|---|
| Cluster | GICS Sector | |
| 0 | Consumer Discretionary | 3 |
| Consumer Staples | 1 | |
| Energy | 21 | |
| Health Care | 3 | |
| Industrials | 1 | |
| Information Technology | 8 | |
| Materials | 3 | |
| Telecommunications Services | 1 | |
| 1 | Consumer Discretionary | 36 |
| Consumer Staples | 17 | |
| Energy | 5 | |
| Financials | 46 | |
| Health Care | 35 | |
| Industrials | 52 | |
| Information Technology | 24 | |
| Materials | 17 | |
| Real Estate | 27 | |
| Telecommunications Services | 2 | |
| Utilities | 24 | |
| 2 | Energy | 3 |
| 3 | Consumer Discretionary | 1 |
| Consumer Staples | 1 | |
| Energy | 1 | |
| Financials | 3 | |
| Health Care | 2 | |
| Information Technology | 1 | |
| Telecommunications Services | 2 |
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")
# selecting numerical columns
num_col = dfkm1.select_dtypes(include=np.number).columns.tolist()
for i, variable in enumerate(num_col):
plt.subplot(3, 4, i + 1)
sns.boxplot(data=dfkm1, x="Cluster", y=variable)
plt.tight_layout(pad=2.0)
As a reminder: data2 is a copy of the data frame that drops both GICS Sector and GICS Sub Industry. This data frame may lose too much information to be useful but will not suffer from the curse of dimensionality at all
# make a copy of data1 so we don't change it so we can use data1 for hierarchical clustering later
k_means_df2 = data2.copy()
clusters = range(1, 15)
meanDistortions = []
for k in clusters:
model = KMeans(n_clusters=k, random_state=1)
model.fit(data2)
prediction = model.predict(k_means_df2)
distortion = (
sum(np.min(cdist(k_means_df2, model.cluster_centers_, "euclidean"), axis=1))
/ k_means_df1.shape[0]
)
meanDistortions.append(distortion)
print("Number of Clusters:", k, "\tAverage Distortion:", distortion)
plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20)
plt.show()
Number of Clusters: 1 Average Distortion: 2.5425069919221697 Number of Clusters: 2 Average Distortion: 2.3862098789299604 Number of Clusters: 3 Average Distortion: 2.33620927590848 Number of Clusters: 4 Average Distortion: 2.219050563833442 Number of Clusters: 5 Average Distortion: 2.133404401901685 Number of Clusters: 6 Average Distortion: 2.081503686093715 Number of Clusters: 7 Average Distortion: 2.0045413402786814 Number of Clusters: 8 Average Distortion: 1.9864237824874411 Number of Clusters: 9 Average Distortion: 1.956222103389025 Number of Clusters: 10 Average Distortion: 1.9360473996664198 Number of Clusters: 11 Average Distortion: 1.8615942883461607 Number of Clusters: 12 Average Distortion: 1.8219574388532505 Number of Clusters: 13 Average Distortion: 1.7936924742607907 Number of Clusters: 14 Average Distortion: 1.7567842179093438
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(1, 15), timings=True)
visualizer.fit(k_means_df2) # fit the data to the visualizer
visualizer.show() # finalize and render figure
<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
# silhouette scores
silhouette_scores = []
cluster_list = range(2,15)
for k in range(2, 15):
kmeans = KMeans(n_clusters=k, random_state=1).fit(k_means_df2)
predictions = kmeans.fit_predict(k_means_df2)
score = silhouette_score(k_means_df2, predictions)
silhouette_scores.append(score)
print("Number of Clusters:", k, "\tAverage Silhouette Score:", score)
plt.plot(cluster_list, silhouette_scores)
plt.xlabel("Number of Clusters")
plt.ylabel("Average Silhouette Score")
plt.title("Selecting k with the Silhouette Method", fontsize=20)
plt.show()
Number of Clusters: 2 Average Silhouette Score: 0.45335782729503565 Number of Clusters: 3 Average Silhouette Score: 0.40374060030338865 Number of Clusters: 4 Average Silhouette Score: 0.4246430808437099 Number of Clusters: 5 Average Silhouette Score: 0.4381539778147092 Number of Clusters: 6 Average Silhouette Score: 0.40869599703024256 Number of Clusters: 7 Average Silhouette Score: 0.1207450219233897 Number of Clusters: 8 Average Silhouette Score: 0.3693991650696542 Number of Clusters: 9 Average Silhouette Score: 0.35185096182499204 Number of Clusters: 10 Average Silhouette Score: 0.32950073703610283 Number of Clusters: 11 Average Silhouette Score: 0.1486586842527321 Number of Clusters: 12 Average Silhouette Score: 0.15784241071085106 Number of Clusters: 13 Average Silhouette Score: 0.15646997458716602 Number of Clusters: 14 Average Silhouette Score: 0.16253506827999134
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(2,15), metric="silhouette", timings=True)
visualizer.fit(k_means_df2) # fit the data to the visualizer
visualizer.show() # finalize and render figure
<Axes: title={'center': 'Silhouette Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='silhouette score'>
Between the silhouette score and the elbow method, clusters of 2, 5, 6 and 8 seem lik they're worth exploring.
# finding optimal number of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(n_clusters=2, random_state=1))
visualizer.fit(k_means_df2)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 2 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# finding optimal number of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(n_clusters=5, random_state=1))
visualizer.fit(k_means_df2)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 5 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# finding optimal number of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(n_clusters=6, random_state=1))
visualizer.fit(k_means_df2)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 6 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# finding optimal number of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(n_clusters=8, random_state=1))
visualizer.fit(k_means_df2)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 8 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
Once again, the highest silhouette scores aren't very high and our model is putting the majority of data points into one cluster. There is very little useful information to be obtained from this model or from K-means on data1. I am going to go with 5 clusters because it has the second highest silhouette score while still explaining a decent amount of the in-cluster variance.
# final kmeans model for data1
kmeans = KMeans(n_clusters=5, random_state=1)
kmeans.fit(k_means_df2)
KMeans(n_clusters=5, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=5, random_state=1)
# create a copy from original copy
dfkm2 = data.copy() # name comes from data frame kmeans on data 1
# adding kmeans cluster labels to the copy of the original copy and to the scaled data frame data1
dfkm2['Cluster'] = kmeans.labels_
k_means_df2['Cluster'] = kmeans.labels_
# cluster profiling
km_cluster_profile_2 = dfkm2.groupby('Cluster').mean(numeric_only=True)
km_cluster_profile_2["Count of Companies in this Cluster"] = (dfkm2.groupby('Cluster')["Security"].count().values)
km_cluster_profile_2.head()
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | Count of Companies in this Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cluster | ||||||||||||
| 0 | 72.738269 | 5.179897 | 1.380738 | 34.825455 | 53.138182 | -1.014729e+07 | 1.488642e+09 | 3.636164 | 4.379616e+08 | 23.680917 | -3.395254 | 275 |
| 1 | 65.106668 | -11.888125 | 2.722141 | 44.000000 | 61.400000 | -3.685830e+07 | -2.137169e+09 | -5.560333 | 5.297142e+08 | 113.488924 | 0.905486 | 30 |
| 2 | 24.485001 | -13.351992 | 3.482611 | 802.000000 | 51.000000 | -1.292500e+09 | -1.910650e+10 | -41.815000 | 5.195740e+08 | 60.748608 | 1.565141 | 2 |
| 3 | 46.672222 | 5.166566 | 1.079367 | 25.000000 | 58.333333 | -3.040667e+09 | 1.484844e+10 | 3.435556 | 4.564960e+09 | 15.596051 | -6.354193 | 9 |
| 4 | 211.164720 | 12.456786 | 1.699388 | 30.708333 | 280.250000 | 2.197085e+09 | 2.808601e+09 | 6.818333 | 7.389574e+08 | 37.895420 | 15.682619 | 24 |
# print which companies are in each cluster
for cluster in dfkm2['Cluster'].unique():
print(f'Cluster {cluster}')
print(dfkm2[dfkm2['Cluster'] == cluster]['Security'].unique())
print('')
Cluster 0 ['American Airlines Group' 'AbbVie' 'Abbott Laboratories' 'Adobe Systems Inc' 'Archer-Daniels-Midland Co' 'Ameren Corp' 'American Electric Power' 'AFLAC Inc' 'American International Group, Inc.' 'Apartment Investment & Mgmt' 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Akamai Technologies Inc' 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp' 'Allegion' 'Applied Materials Inc' 'AMETEK Inc' 'Affiliated Managers Group Inc' 'Ameriprise Financial' 'American Tower Corp A' 'AutoNation Inc' 'Anthem Inc.' 'Aon plc' 'Amphenol Corp' 'Arconic Inc' 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'Broadcom' 'American Water Works Company Inc' 'American Express Co' 'Boeing Company' 'Baxter International Inc.' 'BB&T Corporation' 'Bard (C.R.) Inc.' 'The Bank of New York Mellon Corp.' 'Ball Corp' 'Bristol-Myers Squibb' 'Boston Scientific' 'BorgWarner' 'Boston Properties' 'Caterpillar Inc.' 'Chubb Limited' 'CBRE Group' 'Crown Castle International Corp.' 'Carnival Corp.' 'CF Industries Holdings Inc' 'Citizens Financial Group' 'Church & Dwight' 'C. H. Robinson Worldwide' 'Charter Communications' 'CIGNA Corp.' 'Cincinnati Financial' 'Colgate-Palmolive' 'Comerica Inc.' 'CME Group Inc.' 'Cummins Inc.' 'CMS Energy' 'Centene Corporation' 'CenterPoint Energy' 'Capital One Financial' 'The Cooper Companies' 'CSX Corp.' 'CenturyLink Inc' 'Cognizant Technology Solutions' 'Citrix Systems' 'CVS Health' 'Chevron Corp.' 'Dominion Resources' 'Delta Air Lines' 'Du Pont (E.I.)' 'Deere & Co.' 'Discover Financial Services' 'Quest Diagnostics' 'Danaher Corp.' 'The Walt Disney Company' 'Discovery Communications-A' 'Discovery Communications-C' 'Delphi Automotive' 'Digital Realty Trust' 'Dun & Bradstreet' 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy' 'DaVita Inc.' 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison' 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'Equity Residential' 'Eversource Energy' 'Essex Property Trust, Inc.' 'E*Trade' 'Eaton Corporation' 'Entergy Corp.' 'Exelon Corp.' "Expeditors Int'l" 'Expedia Inc.' 'Extra Space Storage' 'Fastenal Co' 'Fortune Brands Home & Security' 'FirstEnergy Corp' 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems' 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation' 'Federal Realty Investment Trust' 'General Dynamics' 'General Growth Properties Inc.' 'Corning Inc.' 'General Motors' 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber' 'Grainger (W.W.) Inc.' 'Hasbro Inc.' 'Huntington Bancshares' 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.' 'Hartford Financial Svc.Gp.' 'Harley-Davidson' "Honeywell Int'l Inc." 'HP Inc.' 'Hormel Foods Corp.' 'Henry Schein' 'Host Hotels & Resorts' 'The Hershey Company' 'Humana Inc.' 'International Business Machines' 'IDEXX Laboratories' 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group' 'Iron Mountain Incorporated' 'Illinois Tool Works' 'Invesco Ltd.' 'J. B. Hunt Transport Services' 'Jacobs Engineering Group' 'Juniper Networks' 'Kimco Realty' 'Kimberly-Clark' 'Kansas City Southern' 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding' 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.' 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.' 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell' 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l." 'Masco Corp.' 'Mattel Inc.' "Moody's Corp" 'Mondelez International' 'MetLife Inc.' 'Mohawk Industries' 'Mead Johnson' 'McCormick & Co.' 'Martin Marietta Materials' 'Marsh & McLennan' '3M Company' 'Altria Group Inc' 'Marathon Petroleum' 'Merck & Co.' 'M&T Bank Corp.' 'Mettler Toledo' 'Mylan N.V.' 'Navient' 'NASDAQ OMX Group' 'NextEra Energy' 'Newmont Mining Corp. (Hldg. Co.)' 'Nielsen Holdings' 'Norfolk Southern Corp.' 'Northern Trust Corp.' 'Nucor Corp.' 'Newell Brands' 'Realty Income Corporation' 'Omnicom Group' "O'Reilly Automotive" "People's United Financial" 'Pitney-Bowes' 'PACCAR Inc.' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.' 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.' 'Pulte Homes Inc.' 'Philip Morris International' 'PNC Financial Services' 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.' 'Prudential Financial' 'Phillips 66' 'Praxair Inc.' 'PayPal' 'Ryder System' 'Royal Caribbean Cruises Ltd' 'Robert Half International' 'Roper Industries' 'Republic Services Inc' 'SCANA Corp' 'Charles Schwab Corporation' 'Sealed Air' 'Sherwin-Williams' 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Southern Co.' 'Simon Property Group Inc' 'S&P Global, Inc.' 'Stericycle Inc' 'Sempra Energy' 'SunTrust Banks' 'State Street Corp.' 'Skyworks Solutions' 'Synchrony Financial' 'Stryker Corp.' 'Molson Coors Brewing Company' 'Tegna, Inc.' 'Torchmark Corp.' 'Thermo Fisher Scientific' 'The Travelers Companies Inc.' 'Tractor Supply Company' 'Tyson Foods' 'Tesoro Petroleum Co.' 'Total System Services' 'Texas Instruments' 'Under Armour' 'United Continental Holdings' 'UDR Inc' 'Universal Health Services, Inc.' 'United Health Group Inc.' 'Unum Group' 'Union Pacific' 'United Parcel Service' 'United Technologies' 'Varian Medical Systems' 'Valero Energy' 'Vulcan Materials' 'Vornado Realty Trust' 'Verisk Analytics' 'Verisign Inc.' 'Ventas Inc' 'Wec Energy Group Inc' 'Whirlpool Corp.' 'Waste Management Inc.' 'Western Union Co' 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Xcel Energy Inc' 'XL Capital' 'Dentsply Sirona' 'Xerox Corp.' 'Xylem Inc.' 'Yum! Brands Inc' 'Zimmer Biomet Holdings' 'Zions Bancorp' 'Zoetis'] Cluster 1 ['Analog Devices, Inc.' 'Alexion Pharmaceuticals' 'Amazon.com Inc' 'Anadarko Petroleum Corp' 'Baker Hughes Inc' 'Cabot Oil & Gas' 'Concho Resources' 'Devon Energy Corp.' 'EOG Resources' 'EQT Corporation' 'Freeport-McMoran Cp & Gld' 'Hess Corporation' 'Hewlett Packard Enterprise' 'Kinder Morgan' 'The Mosaic Company' 'Marathon Oil Corp.' 'Murphy Oil' 'Noble Energy Inc' 'Netflix Inc.' 'Newfield Exploration Co' 'National Oilwell Varco Inc.' 'ONEOK' 'Occidental Petroleum' 'Quanta Services Inc.' 'Range Resources Corp.' 'Spectra Energy Corp.' 'Southwestern Energy' 'Teradata Corp.' 'Williams Cos.' 'Cimarex Energy'] Cluster 4 ['Alliance Data Systems' 'Amgen Inc' 'Bank of America Corp' 'BIOGEN IDEC Inc.' 'Celgene Corp.' 'Chipotle Mexican Grill' 'Equinix' 'Edwards Lifesciences' 'Facebook' 'First Solar Inc' 'Frontier Communications' 'Gilead Sciences' 'Halliburton Co.' 'Intel Corp.' 'Intuitive Surgical Inc.' "McDonald's Corp." 'Monster Beverage' 'Priceline.com Inc' 'Regeneron' 'TripAdvisor' 'Vertex Pharmaceuticals Inc' 'Waters Corporation' 'Wynn Resorts Ltd' 'Yahoo Inc.'] Cluster 2 ['Apache Corporation' 'Chesapeake Energy'] Cluster 3 ['Citigroup Inc.' 'Ford Motor' 'JPMorgan Chase & Co.' 'Coca Cola Company' 'Pfizer Inc.' 'AT&T Inc' 'Verizon Communications' 'Wells Fargo' 'Exxon Mobil Corp.']
dfkm2.groupby(["Cluster", "GICS Sector"])['Security'].count()
| Security | ||
|---|---|---|
| Cluster | GICS Sector | |
| 0 | Consumer Discretionary | 33 |
| Consumer Staples | 17 | |
| Energy | 5 | |
| Financials | 45 | |
| Health Care | 29 | |
| Industrials | 52 | |
| Information Technology | 24 | |
| Materials | 18 | |
| Real Estate | 26 | |
| Telecommunications Services | 2 | |
| Utilities | 24 | |
| 1 | Consumer Discretionary | 1 |
| Energy | 21 | |
| Health Care | 1 | |
| Industrials | 1 | |
| Information Technology | 4 | |
| Materials | 2 | |
| 2 | Energy | 2 |
| 3 | Consumer Discretionary | 1 |
| Consumer Staples | 1 | |
| Energy | 1 | |
| Financials | 3 | |
| Health Care | 1 | |
| Telecommunications Services | 2 | |
| 4 | Consumer Discretionary | 5 |
| Consumer Staples | 1 | |
| Energy | 1 | |
| Financials | 1 | |
| Health Care | 9 | |
| Information Technology | 5 | |
| Real Estate | 1 | |
| Telecommunications Services | 1 |
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")
# selecting numerical columns
num_col = dfkm2.select_dtypes(include=np.number).columns.tolist()
for i, variable in enumerate(num_col):
plt.subplot(3, 4, i + 1)
sns.boxplot(data=dfkm2, x="Cluster", y=variable)
plt.tight_layout(pad=2.0)
As a reminder: data1 is a copy of the data frame with both GICS Sector and GICS Sub Industry dummy encoded. The concern here will be curse of dimensionality as we are increasing the number of columns from 13 to 124 even with drop_first=True
hc_df1 = data1.copy()
# distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
# linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(hc_df1, metric=dm, method=lm)
c, coph_dists = cophenet(Z, pdist(hc_df1))
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = dm
high_dm_lm[1] = lm
print(
"The highest cophenetic correlation is {} for {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Cophenetic correlation for Euclidean distance and single linkage is 0.9192696289462164. Cophenetic correlation for Euclidean distance and complete linkage is 0.5839761524776856. Cophenetic correlation for Euclidean distance and average linkage is 0.9387918818713881. Cophenetic correlation for Euclidean distance and weighted linkage is 0.9017395996624018. Cophenetic correlation for Chebyshev distance and single linkage is 0.8733698250376598. Cophenetic correlation for Chebyshev distance and complete linkage is 0.7629621936822496. Cophenetic correlation for Chebyshev distance and average linkage is 0.9194190244129151. Cophenetic correlation for Chebyshev distance and weighted linkage is 0.8866543829752598. Cophenetic correlation for Mahalanobis distance and single linkage is -0.018111220588679713. Cophenetic correlation for Mahalanobis distance and complete linkage is 0.10942269236725295. Cophenetic correlation for Mahalanobis distance and average linkage is 0.07162884851014278. Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.05628335494200424. Cophenetic correlation for Cityblock distance and single linkage is 0.9110913485411501. Cophenetic correlation for Cityblock distance and complete linkage is 0.6986069960985519. Cophenetic correlation for Cityblock distance and average linkage is 0.891363494361917. Cophenetic correlation for Cityblock distance and weighted linkage is 0.6244033214763337. The highest cophenetic correlation is 0.9387918818713881 for Euclidean distance and average linkage.
It looks like Euclidean distance and Average linkage are the best options for data1.
# let's check linkage types for Euclidean only
linkage_types = ["single", "complete", "average", "weighted", "centroid", "median", "ward"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for lm in linkage_types:
Z = linkage(hc_df1, metric="euclidean", method=lm)
c, coph_dists = cophenet(Z, pdist(hc_df1))
print(
"Cophenetic correlation for Euclidean distance and {} linkage is {}.".format(
lm, c
)
)
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = "Euclidean"
high_dm_lm[1] = lm
# let's print the linkage method that works best for Euclidean
print(
"The highest cophenetic correlation is {} for {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Cophenetic correlation for Euclidean distance and single linkage is 0.9192696289462164. Cophenetic correlation for Euclidean distance and complete linkage is 0.5839761524776856. Cophenetic correlation for Euclidean distance and average linkage is 0.9387918818713881. Cophenetic correlation for Euclidean distance and weighted linkage is 0.9017395996624018. Cophenetic correlation for Euclidean distance and centroid linkage is 0.946013261636426. Cophenetic correlation for Euclidean distance and median linkage is 0.914449553156567. Cophenetic correlation for Euclidean distance and ward linkage is 0.6910854135257704. The highest cophenetic correlation is 0.946013261636426 for Euclidean distance and centroid linkage.
We improved the cophentic correlation by about 0.01 by doing this.
# linkage types
linkage_types = ["single", "complete", "average", "centroid", "ward", "weighted", "median"]
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
compare = []
# to create a subplot image
fig, axs = plt.subplots(len(linkage_types), 1, figsize=(15, 30))
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_types):
Z = linkage(hc_df1, metric="euclidean", method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(hc_df1))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
compare.append([method, coph_corr])
# create and print a dataframe to compare cophenetic correlations for different linkage methods
df_cc1 = pd.DataFrame(compare, columns=compare_cols)
df_cc1 = df_cc1.sort_values(by="Cophenetic Coefficient")
df_cc1
| Linkage | Cophenetic Coefficient | |
|---|---|---|
| 1 | complete | 0.583976 |
| 4 | ward | 0.691085 |
| 5 | weighted | 0.901740 |
| 6 | median | 0.914450 |
| 0 | single | 0.919270 |
| 2 | average | 0.938792 |
| 3 | centroid | 0.946013 |
# create a model using 4 clusters, and average linkage
HCmodel1 = AgglomerativeClustering(n_clusters=5, linkage="ward")
HCmodel1.fit(hc_df1)
# adding the cluster labels to the a copy of the original data frame as well as the scaled data frame
dfhc1 = data.copy()
hc_df1["Cluster"] = HCmodel1.labels_ # the scaled data
dfhc1["Cluster"] = HCmodel1.labels_ # the non-scaled data
Note: Even though Centroid had the highest cophentic coefficient, Centroid cannot be passed as an argument for the linkage parameter in AgglomerativeClustering. We will use the 'Ward' model because even though it has a somewhat lower cophenetic score, it divides the data into the most varied clusters. Every other method puts more than 90% of the data in one cluster unless there are dozens of clusters.
hc_cluster_profile_1 = dfhc1.groupby("Cluster").mean(numeric_only=True)
hc_cluster_profile_1["Count of Companies in this Cluster"] = (
dfhc1.groupby("Cluster")["Security"].count().values)
# let's exampine which companies are in each cluster
for cluster in dfhc1['Cluster'].unique():
print(f'Cluster {cluster}')
print(dfhc1[dfhc1['Cluster'] == cluster]['Security'].unique())
print('')
Cluster 3 ['American Airlines Group' 'AbbVie' 'Abbott Laboratories' 'Adobe Systems Inc' 'Archer-Daniels-Midland Co' 'Ameren Corp' 'American Electric Power' 'AFLAC Inc' 'American International Group, Inc.' 'Apartment Investment & Mgmt' 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Akamai Technologies Inc' 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp' 'Applied Materials Inc' 'AMETEK Inc' 'Affiliated Managers Group Inc' 'Ameriprise Financial' 'American Tower Corp A' 'AutoNation Inc' 'Anthem Inc.' 'Aon plc' 'Amphenol Corp' 'Arconic Inc' 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'Broadcom' 'American Water Works Company Inc' 'American Express Co' 'Boeing Company' 'Baxter International Inc.' 'BB&T Corporation' 'Bard (C.R.) Inc.' 'BIOGEN IDEC Inc.' 'The Bank of New York Mellon Corp.' 'Ball Corp' 'Bristol-Myers Squibb' 'Boston Scientific' 'BorgWarner' 'Boston Properties' 'Caterpillar Inc.' 'Chubb Limited' 'CBRE Group' 'Crown Castle International Corp.' 'Carnival Corp.' 'CF Industries Holdings Inc' 'Citizens Financial Group' 'Church & Dwight' 'C. H. Robinson Worldwide' 'CIGNA Corp.' 'Cincinnati Financial' 'Comerica Inc.' 'CME Group Inc.' 'Cummins Inc.' 'CMS Energy' 'Centene Corporation' 'CenterPoint Energy' 'Capital One Financial' 'The Cooper Companies' 'CSX Corp.' 'CenturyLink Inc' 'Cognizant Technology Solutions' 'Citrix Systems' 'CVS Health' 'Chevron Corp.' 'Dominion Resources' 'Delta Air Lines' 'Du Pont (E.I.)' 'Deere & Co.' 'Discover Financial Services' 'Quest Diagnostics' 'Danaher Corp.' 'The Walt Disney Company' 'Discovery Communications-A' 'Discovery Communications-C' 'Delphi Automotive' 'Digital Realty Trust' 'Dun & Bradstreet' 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy' 'DaVita Inc.' 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison' 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'Equity Residential' 'Eversource Energy' 'Essex Property Trust, Inc.' 'E*Trade' 'Eaton Corporation' 'Entergy Corp.' 'Exelon Corp.' "Expeditors Int'l" 'Expedia Inc.' 'Extra Space Storage' 'Fastenal Co' 'Fortune Brands Home & Security' 'FirstEnergy Corp' 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems' 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation' 'Federal Realty Investment Trust' 'First Solar Inc' 'General Dynamics' 'General Growth Properties Inc.' 'Gilead Sciences' 'Corning Inc.' 'General Motors' 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber' 'Grainger (W.W.) Inc.' 'Hasbro Inc.' 'Huntington Bancshares' 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.' 'Hartford Financial Svc.Gp.' 'Harley-Davidson' "Honeywell Int'l Inc." 'HP Inc.' 'Hormel Foods Corp.' 'Henry Schein' 'Host Hotels & Resorts' 'The Hershey Company' 'Humana Inc.' 'International Business Machines' 'IDEXX Laboratories' 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group' 'Iron Mountain Incorporated' 'Illinois Tool Works' 'Invesco Ltd.' 'J. B. Hunt Transport Services' 'Jacobs Engineering Group' 'Juniper Networks' 'Kimco Realty' 'Kansas City Southern' 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding' 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.' 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.' 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell' 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l." 'Masco Corp.' 'Mattel Inc.' "Moody's Corp" 'Mondelez International' 'MetLife Inc.' 'Mohawk Industries' 'Mead Johnson' 'McCormick & Co.' 'Martin Marietta Materials' 'Marsh & McLennan' '3M Company' 'Altria Group Inc' 'The Mosaic Company' 'Marathon Petroleum' 'Merck & Co.' 'M&T Bank Corp.' 'Mettler Toledo' 'Mylan N.V.' 'Navient' 'NASDAQ OMX Group' 'NextEra Energy' 'Newmont Mining Corp. (Hldg. Co.)' 'Nielsen Holdings' 'Norfolk Southern Corp.' 'Northern Trust Corp.' 'Nucor Corp.' 'Newell Brands' 'Realty Income Corporation' 'Omnicom Group' "O'Reilly Automotive" "People's United Financial" 'Pitney-Bowes' 'PACCAR Inc.' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.' 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.' 'Pulte Homes Inc.' 'Philip Morris International' 'PNC Financial Services' 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.' 'Prudential Financial' 'Phillips 66' 'Quanta Services Inc.' 'Praxair Inc.' 'PayPal' 'Ryder System' 'Royal Caribbean Cruises Ltd' 'Robert Half International' 'Roper Industries' 'Republic Services Inc' 'SCANA Corp' 'Charles Schwab Corporation' 'Sealed Air' 'Sherwin-Williams' 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Southern Co.' 'Simon Property Group Inc' 'Stericycle Inc' 'Sempra Energy' 'SunTrust Banks' 'State Street Corp.' 'Skyworks Solutions' 'Synchrony Financial' 'Stryker Corp.' 'Molson Coors Brewing Company' 'Tegna, Inc.' 'Torchmark Corp.' 'Thermo Fisher Scientific' 'TripAdvisor' 'The Travelers Companies Inc.' 'Tractor Supply Company' 'Tyson Foods' 'Tesoro Petroleum Co.' 'Total System Services' 'Texas Instruments' 'Under Armour' 'United Continental Holdings' 'UDR Inc' 'Universal Health Services, Inc.' 'United Health Group Inc.' 'Unum Group' 'Union Pacific' 'United Parcel Service' 'United Technologies' 'Varian Medical Systems' 'Valero Energy' 'Vulcan Materials' 'Vornado Realty Trust' 'Verisk Analytics' 'Verisign Inc.' 'Ventas Inc' 'Wec Energy Group Inc' 'Whirlpool Corp.' 'Waste Management Inc.' 'Western Union Co' 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Wynn Resorts Ltd' 'Xcel Energy Inc' 'XL Capital' 'Dentsply Sirona' 'Xerox Corp.' 'Xylem Inc.' 'Yum! Brands Inc' 'Zimmer Biomet Holdings' 'Zions Bancorp' 'Zoetis'] Cluster 0 ['Analog Devices, Inc.' 'Alexion Pharmaceuticals' 'Amazon.com Inc' 'Anadarko Petroleum Corp' 'Baker Hughes Inc' 'Cabot Oil & Gas' 'Concho Resources' 'Devon Energy Corp.' 'EOG Resources' 'EQT Corporation' 'Freeport-McMoran Cp & Gld' 'Halliburton Co.' 'Hess Corporation' 'Hewlett Packard Enterprise' 'Kinder Morgan' 'Marathon Oil Corp.' 'Murphy Oil' 'Noble Energy Inc' 'Netflix Inc.' 'Newfield Exploration Co' 'National Oilwell Varco Inc.' 'ONEOK' 'Occidental Petroleum' 'Range Resources Corp.' 'Spectra Energy Corp.' 'Southwestern Energy' 'Teradata Corp.' 'Williams Cos.' 'Cimarex Energy'] Cluster 1 ['Alliance Data Systems' 'Amgen Inc' 'Celgene Corp.' 'Chipotle Mexican Grill' 'Equinix' 'Edwards Lifesciences' 'Facebook' 'Frontier Communications' 'Intuitive Surgical Inc.' "McDonald's Corp." 'Monster Beverage' 'Priceline.com Inc' 'Regeneron' 'Vertex Pharmaceuticals Inc' 'Waters Corporation' 'Yahoo Inc.'] Cluster 4 ['Allegion' 'Apache Corporation' 'Chesapeake Energy' 'Charter Communications' 'Colgate-Palmolive' 'Kimberly-Clark' 'S&P Global, Inc.'] Cluster 2 ['Bank of America Corp' 'Citigroup Inc.' 'Ford Motor' 'Intel Corp.' 'JPMorgan Chase & Co.' 'Coca Cola Company' 'Pfizer Inc.' 'AT&T Inc' 'Verizon Communications' 'Wells Fargo' 'Exxon Mobil Corp.']
There is a moderate imbalance between cluster sizes and the cophentic coefficient for the Ward dendogram was only 0.69, but every other hierarchical clustering method had a single cluster containing more than 90% of the data points. This may be due to:
# let's display the hc_cluster_profile_1
hc_cluster_profile_1.style.highlight_max(color="lightgreen", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | Count of Companies in this Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cluster | ||||||||||||
| 0 | 66.875863 | -11.513243 | 2.684321 | 44.965517 | 67.517241 | 270355000.000000 | -2279596724.137931 | -5.930345 | 557924857.946207 | 119.832148 | 1.185239 | 29 |
| 1 | 272.166454 | 9.664338 | 1.587885 | 23.375000 | 337.187500 | 547402937.500000 | 1135918937.500000 | 7.585000 | 578873966.226250 | 40.700126 | 19.675682 | 16 |
| 2 | 42.848182 | 6.270446 | 1.123547 | 22.727273 | 71.454545 | 558636363.636364 | 14631272727.272728 | 3.410000 | 4242572567.290909 | 15.242169 | -4.924615 | 11 |
| 3 | 72.697884 | 5.306429 | 1.409486 | 25.631769 | 55.718412 | 425447.653430 | 1551832010.830325 | 3.729477 | 437878516.249134 | 23.459500 | -2.881193 | 277 |
| 4 | 84.355716 | 3.854981 | 1.827670 | 633.571429 | 33.571429 | -568400000.000000 | -4968157142.857142 | -10.841429 | 398169036.442857 | 42.284541 | -11.589502 | 7 |
# let's see the minimums as well
hc_cluster_profile_1.style.highlight_min(color="red", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | Count of Companies in this Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cluster | ||||||||||||
| 0 | 66.875863 | -11.513243 | 2.684321 | 44.965517 | 67.517241 | 270355000.000000 | -2279596724.137931 | -5.930345 | 557924857.946207 | 119.832148 | 1.185239 | 29 |
| 1 | 272.166454 | 9.664338 | 1.587885 | 23.375000 | 337.187500 | 547402937.500000 | 1135918937.500000 | 7.585000 | 578873966.226250 | 40.700126 | 19.675682 | 16 |
| 2 | 42.848182 | 6.270446 | 1.123547 | 22.727273 | 71.454545 | 558636363.636364 | 14631272727.272728 | 3.410000 | 4242572567.290909 | 15.242169 | -4.924615 | 11 |
| 3 | 72.697884 | 5.306429 | 1.409486 | 25.631769 | 55.718412 | 425447.653430 | 1551832010.830325 | 3.729477 | 437878516.249134 | 23.459500 | -2.881193 | 277 |
| 4 | 84.355716 | 3.854981 | 1.827670 | 633.571429 | 33.571429 | -568400000.000000 | -4968157142.857142 | -10.841429 | 398169036.442857 | 42.284541 | -11.589502 | 7 |
plt.figure(figsize=(20, 35))
plt.suptitle("Boxplot of scaled numerical variables for each cluster", fontsize=20)
# create a list of the numerical columns
num_cols = dfhc1.select_dtypes(include=np.number).columns.tolist()
for i, variable in enumerate(num_cols):
plt.subplot(5, 3, i + 1)
sns.boxplot(data=dfhc1, x="Cluster", y=variable)
plt.tight_layout(pad=2.0)
As a reminder: data2 is a copy of the data frame that drops both GICS Sector and GICS Sub Industry. This data frame may lose too much information to be useful but will not suffer from the curse of dimensionality at all
hc_df2 = data2.copy()
# distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
# linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(hc_df2, metric=dm, method=lm)
c, coph_dists = cophenet(Z, pdist(hc_df2))
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = dm
high_dm_lm[1] = lm
print(
"The highest cophenetic correlation is {} for {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Cophenetic correlation for Euclidean distance and single linkage is 0.9232271494002922. Cophenetic correlation for Euclidean distance and complete linkage is 0.7873280186580672. Cophenetic correlation for Euclidean distance and average linkage is 0.9422540609560814. Cophenetic correlation for Euclidean distance and weighted linkage is 0.8693784298129404. Cophenetic correlation for Chebyshev distance and single linkage is 0.9062538164750717. Cophenetic correlation for Chebyshev distance and complete linkage is 0.598891419111242. Cophenetic correlation for Chebyshev distance and average linkage is 0.9338265528030499. Cophenetic correlation for Chebyshev distance and weighted linkage is 0.9127355892367. Cophenetic correlation for Mahalanobis distance and single linkage is 0.925919553052459. Cophenetic correlation for Mahalanobis distance and complete linkage is 0.7925307202850002. Cophenetic correlation for Mahalanobis distance and average linkage is 0.9247324030159736. Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.8708317490180428. Cophenetic correlation for Cityblock distance and single linkage is 0.9334186366528574. Cophenetic correlation for Cityblock distance and complete linkage is 0.7375328863205818. Cophenetic correlation for Cityblock distance and average linkage is 0.9302145048594667. Cophenetic correlation for Cityblock distance and weighted linkage is 0.731045513520281. The highest cophenetic correlation is 0.9422540609560814 for Euclidean distance and average linkage.
It appears the best distance type and linkage method is Euclidean distance with Average linkage
# let's check linkage types for Euclidean only
linkage_types = ["single", "complete", "average", "weighted", "centroid", "median", "ward"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for lm in linkage_types:
Z = linkage(hc_df2, metric="euclidean", method=lm)
c, coph_dists = cophenet(Z, pdist(hc_df2))
print(
"Cophenetic correlation for Euclidean distance and {} linkage is {}.".format(
lm, c
)
)
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = "Euclidean"
high_dm_lm[1] = lm
# let's print the linkage method that works best for Euclidean
print(
"The highest cophenetic correlation is {} for {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Cophenetic correlation for Euclidean distance and single linkage is 0.9232271494002922. Cophenetic correlation for Euclidean distance and complete linkage is 0.7873280186580672. Cophenetic correlation for Euclidean distance and average linkage is 0.9422540609560814. Cophenetic correlation for Euclidean distance and weighted linkage is 0.8693784298129404. Cophenetic correlation for Euclidean distance and centroid linkage is 0.9314012446828154. Cophenetic correlation for Euclidean distance and median linkage is 0.9198690668829905. Cophenetic correlation for Euclidean distance and ward linkage is 0.7101180299865353. The highest cophenetic correlation is 0.9422540609560814 for Euclidean distance and average linkage.
We found the same results.
# linkage types
linkage_types = ["single", "complete", "average", "centroid", "ward", "weighted", "median"]
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
compare = []
# to create a subplot image
fig, axs = plt.subplots(len(linkage_types), 1, figsize=(15, 30))
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_types):
Z = linkage(hc_df2, metric="euclidean", method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(hc_df2))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
compare.append([method, coph_corr])
# create and print a dataframe to compare cophenetic correlations for different linkage methods
df_cc2 = pd.DataFrame(compare, columns=compare_cols)
df_cc2 = df_cc2.sort_values(by="Cophenetic Coefficient")
df_cc2
| Linkage | Cophenetic Coefficient | |
|---|---|---|
| 4 | ward | 0.710118 |
| 1 | complete | 0.787328 |
| 5 | weighted | 0.869378 |
| 6 | median | 0.919869 |
| 0 | single | 0.923227 |
| 3 | centroid | 0.931401 |
| 2 | average | 0.942254 |
It appears again that despite dropping the extra dummy variables for GICS Sector and GICS Sub Industry, every clustering method except Ward puts more than 90% of the data into one cluster. However, compared to data1, we have a higher cophenetic coefficient so we are outperforming data1 hierarchical clustering. We will use Ward again because it gives the best variability of clusters despite only having a cophenetic coefficient of 0.71.
# create a model using 4 clusters, and average linkage
HCmodel2 = AgglomerativeClustering(n_clusters=5, linkage="ward")
HCmodel2.fit(hc_df2)
# adding the cluster labels to the a copy of the original data frame as well as the scaled data frame
dfhc2 = data.copy()
hc_df2["Cluster"] = HCmodel2.labels_ # the scaled data
dfhc2["Cluster"] = HCmodel2.labels_ # the non-scaled data
hc_cluster_profile_2 = dfhc2.groupby("Cluster").mean(numeric_only=True)
hc_cluster_profile_2["Count of Companies in this Cluster"] = (
dfhc2.groupby("Cluster")["Security"].count().values)
# let's exampine which companies are in each cluster
for cluster in dfhc2['Cluster'].unique():
print(f'Cluster {cluster}')
print(dfhc2[dfhc2['Cluster'] == cluster]['Security'].unique())
print('')
Cluster 3 ['American Airlines Group' 'AbbVie' 'Abbott Laboratories' 'Adobe Systems Inc' 'Analog Devices, Inc.' 'Archer-Daniels-Midland Co' 'Ameren Corp' 'American Electric Power' 'AFLAC Inc' 'American International Group, Inc.' 'Apartment Investment & Mgmt' 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Akamai Technologies Inc' 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp' 'Applied Materials Inc' 'AMETEK Inc' 'Affiliated Managers Group Inc' 'Ameriprise Financial' 'American Tower Corp A' 'AutoNation Inc' 'Anthem Inc.' 'Aon plc' 'Amphenol Corp' 'Arconic Inc' 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'Broadcom' 'American Water Works Company Inc' 'American Express Co' 'Boeing Company' 'Baxter International Inc.' 'BB&T Corporation' 'Bard (C.R.) Inc.' 'BIOGEN IDEC Inc.' 'The Bank of New York Mellon Corp.' 'Ball Corp' 'Bristol-Myers Squibb' 'Boston Scientific' 'BorgWarner' 'Boston Properties' 'Caterpillar Inc.' 'Chubb Limited' 'CBRE Group' 'Crown Castle International Corp.' 'Carnival Corp.' 'Celgene Corp.' 'CF Industries Holdings Inc' 'Citizens Financial Group' 'Church & Dwight' 'C. H. Robinson Worldwide' 'CIGNA Corp.' 'Cincinnati Financial' 'Comerica Inc.' 'CME Group Inc.' 'Cummins Inc.' 'CMS Energy' 'Centene Corporation' 'CenterPoint Energy' 'Capital One Financial' 'The Cooper Companies' 'CSX Corp.' 'CenturyLink Inc' 'Cognizant Technology Solutions' 'Citrix Systems' 'CVS Health' 'Chevron Corp.' 'Dominion Resources' 'Delta Air Lines' 'Du Pont (E.I.)' 'Deere & Co.' 'Discover Financial Services' 'Quest Diagnostics' 'Danaher Corp.' 'The Walt Disney Company' 'Discovery Communications-A' 'Discovery Communications-C' 'Delphi Automotive' 'Digital Realty Trust' 'Dun & Bradstreet' 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy' 'DaVita Inc.' 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison' 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'Equity Residential' 'EQT Corporation' 'Eversource Energy' 'Essex Property Trust, Inc.' 'E*Trade' 'Eaton Corporation' 'Entergy Corp.' 'Edwards Lifesciences' 'Exelon Corp.' "Expeditors Int'l" 'Expedia Inc.' 'Extra Space Storage' 'Fastenal Co' 'Fortune Brands Home & Security' 'FirstEnergy Corp' 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems' 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation' 'Federal Realty Investment Trust' 'First Solar Inc' 'General Dynamics' 'General Growth Properties Inc.' 'Gilead Sciences' 'Corning Inc.' 'General Motors' 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber' 'Grainger (W.W.) Inc.' 'Halliburton Co.' 'Hasbro Inc.' 'Huntington Bancshares' 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.' 'Hartford Financial Svc.Gp.' 'Harley-Davidson' "Honeywell Int'l Inc." 'Hewlett Packard Enterprise' 'HP Inc.' 'Hormel Foods Corp.' 'Henry Schein' 'Host Hotels & Resorts' 'The Hershey Company' 'Humana Inc.' 'International Business Machines' 'IDEXX Laboratories' 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group' 'Iron Mountain Incorporated' 'Illinois Tool Works' 'Invesco Ltd.' 'J. B. Hunt Transport Services' 'Jacobs Engineering Group' 'Juniper Networks' 'Kimco Realty' 'Kansas City Southern' 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding' 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.' 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.' 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell' 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l." 'Masco Corp.' 'Mattel Inc.' "McDonald's Corp." "Moody's Corp" 'Mondelez International' 'MetLife Inc.' 'Mohawk Industries' 'Mead Johnson' 'McCormick & Co.' 'Martin Marietta Materials' 'Marsh & McLennan' '3M Company' 'Altria Group Inc' 'The Mosaic Company' 'Marathon Petroleum' 'Merck & Co.' 'M&T Bank Corp.' 'Mettler Toledo' 'Mylan N.V.' 'Navient' 'NASDAQ OMX Group' 'NextEra Energy' 'Newmont Mining Corp. (Hldg. Co.)' 'Nielsen Holdings' 'Norfolk Southern Corp.' 'Northern Trust Corp.' 'Nucor Corp.' 'Newell Brands' 'Realty Income Corporation' 'Omnicom Group' "O'Reilly Automotive" "People's United Financial" 'Pitney-Bowes' 'PACCAR Inc.' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.' 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.' 'Pulte Homes Inc.' 'Philip Morris International' 'PNC Financial Services' 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.' 'Prudential Financial' 'Phillips 66' 'Quanta Services Inc.' 'Praxair Inc.' 'PayPal' 'Ryder System' 'Royal Caribbean Cruises Ltd' 'Robert Half International' 'Roper Industries' 'Republic Services Inc' 'SCANA Corp' 'Charles Schwab Corporation' 'Sealed Air' 'Sherwin-Williams' 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Southern Co.' 'Simon Property Group Inc' 'Stericycle Inc' 'Sempra Energy' 'SunTrust Banks' 'State Street Corp.' 'Skyworks Solutions' 'Synchrony Financial' 'Stryker Corp.' 'Molson Coors Brewing Company' 'Tegna, Inc.' 'Torchmark Corp.' 'Thermo Fisher Scientific' 'TripAdvisor' 'The Travelers Companies Inc.' 'Tractor Supply Company' 'Tyson Foods' 'Tesoro Petroleum Co.' 'Total System Services' 'Texas Instruments' 'Under Armour' 'United Continental Holdings' 'UDR Inc' 'Universal Health Services, Inc.' 'United Health Group Inc.' 'Unum Group' 'Union Pacific' 'United Parcel Service' 'United Technologies' 'Varian Medical Systems' 'Valero Energy' 'Vulcan Materials' 'Vornado Realty Trust' 'Verisk Analytics' 'Verisign Inc.' 'Vertex Pharmaceuticals Inc' 'Ventas Inc' 'Wec Energy Group Inc' 'Whirlpool Corp.' 'Waste Management Inc.' 'Western Union Co' 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Wynn Resorts Ltd' 'Xcel Energy Inc' 'XL Capital' 'Dentsply Sirona' 'Xerox Corp.' 'Xylem Inc.' 'Yum! Brands Inc' 'Zimmer Biomet Holdings' 'Zions Bancorp' 'Zoetis'] Cluster 0 ['Alliance Data Systems' 'Alexion Pharmaceuticals' 'Amgen Inc' 'Amazon.com Inc' 'Chipotle Mexican Grill' 'Equinix' 'Facebook' 'Frontier Communications' 'Intuitive Surgical Inc.' 'Monster Beverage' 'Netflix Inc.' 'Priceline.com Inc' 'Regeneron' 'Waters Corporation' 'Yahoo Inc.'] Cluster 1 ['Allegion' 'Apache Corporation' 'Chesapeake Energy' 'Charter Communications' 'Colgate-Palmolive' 'Kimberly-Clark' 'S&P Global, Inc.'] Cluster 4 ['Anadarko Petroleum Corp' 'Baker Hughes Inc' 'Cabot Oil & Gas' 'Concho Resources' 'Devon Energy Corp.' 'EOG Resources' 'Freeport-McMoran Cp & Gld' 'Hess Corporation' 'Kinder Morgan' 'Marathon Oil Corp.' 'Murphy Oil' 'Noble Energy Inc' 'Newfield Exploration Co' 'National Oilwell Varco Inc.' 'ONEOK' 'Occidental Petroleum' 'Range Resources Corp.' 'Spectra Energy Corp.' 'Southwestern Energy' 'Teradata Corp.' 'Williams Cos.' 'Cimarex Energy'] Cluster 2 ['Bank of America Corp' 'Citigroup Inc.' 'Ford Motor' 'Intel Corp.' 'JPMorgan Chase & Co.' 'Coca Cola Company' 'Pfizer Inc.' 'AT&T Inc' 'Verizon Communications' 'Wells Fargo' 'Exxon Mobil Corp.']
There is a moderate imbalance between cluster sizes and the cophentic coefficient for the Ward dendogram was only 0.71, but every other hierarchical clustering method had a single cluster containing more than 90% of the data points. This may be due to:
Unlike data1, we know this is not due to curse of dimensionality because we did not make dummy variables for GICS Sector or GICS Sub Industry for data2.
# let's display the hc_cluster_profile_2
hc_cluster_profile_2.style.highlight_max(color="lightgreen", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | Count of Companies in this Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cluster | ||||||||||||
| 0 | 326.198218 | 10.563242 | 1.642560 | 14.400000 | 309.466667 | 288850666.666667 | 864498533.333333 | 7.785333 | 544900261.301333 | 113.095334 | 19.142151 | 15 |
| 1 | 84.355716 | 3.854981 | 1.827670 | 633.571429 | 33.571429 | -568400000.000000 | -4968157142.857142 | -10.841429 | 398169036.442857 | 42.284541 | -11.589502 | 7 |
| 2 | 42.848182 | 6.270446 | 1.123547 | 22.727273 | 71.454545 | 558636363.636364 | 14631272727.272728 | 3.410000 | 4242572567.290909 | 15.242169 | -4.924615 | 11 |
| 3 | 72.760400 | 5.213307 | 1.427078 | 25.603509 | 60.392982 | 79951512.280702 | 1538594322.807018 | 3.655351 | 446472132.228456 | 24.722670 | -2.647194 | 285 |
| 4 | 36.440455 | -16.073408 | 2.832884 | 57.500000 | 42.409091 | -472834090.909091 | -3161045227.272727 | -8.005000 | 514367806.201818 | 85.555682 | 0.836839 | 22 |
# let's see the minimums as well
hc_cluster_profile_2.style.highlight_min(color="red", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | Count of Companies in this Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cluster | ||||||||||||
| 0 | 326.198218 | 10.563242 | 1.642560 | 14.400000 | 309.466667 | 288850666.666667 | 864498533.333333 | 7.785333 | 544900261.301333 | 113.095334 | 19.142151 | 15 |
| 1 | 84.355716 | 3.854981 | 1.827670 | 633.571429 | 33.571429 | -568400000.000000 | -4968157142.857142 | -10.841429 | 398169036.442857 | 42.284541 | -11.589502 | 7 |
| 2 | 42.848182 | 6.270446 | 1.123547 | 22.727273 | 71.454545 | 558636363.636364 | 14631272727.272728 | 3.410000 | 4242572567.290909 | 15.242169 | -4.924615 | 11 |
| 3 | 72.760400 | 5.213307 | 1.427078 | 25.603509 | 60.392982 | 79951512.280702 | 1538594322.807018 | 3.655351 | 446472132.228456 | 24.722670 | -2.647194 | 285 |
| 4 | 36.440455 | -16.073408 | 2.832884 | 57.500000 | 42.409091 | -472834090.909091 | -3161045227.272727 | -8.005000 | 514367806.201818 | 85.555682 | 0.836839 | 22 |
plt.figure(figsize=(20, 35))
plt.suptitle("Boxplot of scaled numerical variables for each cluster", fontsize=20)
# create a list of the numerical columns
num_cols = dfhc2.select_dtypes(include=np.number).columns.tolist()
for i, variable in enumerate(num_cols):
plt.subplot(5, 3, i + 1)
sns.boxplot(data=dfhc2, x="Cluster", y=variable)
plt.tight_layout(pad=2.0)
If I was the one investing, I would want to avoid the clusters with high variability for safer investments. Because I've build four models and for each model the largest cluster was least weighted by outliers, I would be inclined to pick the companies that appeared in the largest cluster for all four models. This would protect my investments from being invested in companies that were clusters according to poor performance. While some of the small cluster performed well in one area or another, every small cluster also underperformed in some area. Below I will do some code to figure out which companies repeatedly appeared in the largest, safest cluster.
# Get the companies that meet the criteria from each data frame
securities_km1 = set(dfkm1[dfkm1['Cluster'] == 1]['Security']) # for each model I went back and checked which cluster was largest
securities_km2 = set(dfkm2[dfkm2['Cluster'] == 0]['Security'])
securities_hc1 = set(dfhc1[dfhc1['Cluster'] == 3]['Security'])
securities_hc2 = set(dfhc2[dfhc2['Cluster'] == 3]['Security'])
# Find the intersection of the sets
result_securities = securities_km1.intersection(securities_km2, securities_hc1, securities_hc2)
# Convert the result to a list (if needed)
result_securities_list = list(result_securities)
# Display the result
print(result_securities_list)
# Check how many securities are in result_securities_list
print(len(result_securities_list))
['Stryker Corp.', 'Vornado Realty Trust', 'Capital One Financial', 'Mylan N.V.', 'Fiserv Inc', "Honeywell Int'l Inc.", 'Western Union Co', 'Zoetis', '3M Company', 'Mattel Inc.', 'Aon plc', 'Lockheed Martin Corp.', 'FirstEnergy Corp', 'CME Group Inc.', 'Procter & Gamble', 'Illinois Tool Works', 'Ameriprise Financial', 'HCA Holdings', 'Lennar Corp.', 'Boston Scientific', 'Juniper Networks', 'American Water Works Company Inc', "Marriott Int'l.", 'Xcel Energy Inc', 'Zions Bancorp', 'L-3 Communications Holdings', 'Adobe Systems Inc', 'Anthem Inc.', 'Fastenal Co', 'Valero Energy', 'Arthur J. Gallagher & Co.', 'Unum Group', 'Amphenol Corp', 'Consolidated Edison', 'American Express Co', 'NASDAQ OMX Group', 'Dr Pepper Snapple Group', 'Bristol-Myers Squibb', 'Iron Mountain Incorporated', 'AbbVie', 'Realty Income Corporation', 'Southwest Airlines', 'Ball Corp', 'Molson Coors Brewing Company', 'Citrix Systems', 'SCANA Corp', 'Humana Inc.', "People's United Financial", 'Torchmark Corp.', 'Carnival Corp.', 'Allstate Corp', "Expeditors Int'l", 'Alaska Air Group Inc', 'American Electric Power', 'Marsh & McLennan', 'Leucadia National Corp.', 'Eversource Energy', 'Eaton Corporation', 'Pulte Homes Inc.', 'Garmin Ltd.', 'Entergy Corp.', 'Goodyear Tire & Rubber', 'PPG Industries', 'Varian Medical Systems', 'Baxter International Inc.', 'FLIR Systems', 'United Health Group Inc.', 'Southern Co.', "Edison Int'l", 'AFLAC Inc', 'Omnicom Group', 'Tractor Supply Company', 'Principal Financial Group', 'Martin Marietta Materials', 'UDR Inc', 'Du Pont (E.I.)', 'Mettler Toledo', 'Lilly (Eli) & Co.', 'United Continental Holdings', 'Synchrony Financial', 'Vulcan Materials', 'Expedia Inc.', 'Ameren Corp', 'Universal Health Services, Inc.', 'Exelon Corp.', 'Intl Flavors & Fragrances', 'Roper Industries', 'DaVita Inc.', 'Mondelez International', 'Republic Services Inc', 'J. B. Hunt Transport Services', 'CSX Corp.', 'CenterPoint Energy', 'Xylem Inc.', 'Delphi Automotive', 'PACCAR Inc.', 'Progressive Corp.', 'Masco Corp.', 'Mohawk Industries', 'IDEXX Laboratories', 'eBay Inc.', 'Mastercard Inc.', 'CMS Energy', 'Mead Johnson', 'Wyndham Worldwide', 'Applied Materials Inc', 'CF Industries Holdings Inc', 'SunTrust Banks', 'Alliant Energy Corp', 'Duke Energy', 'Welltower Inc.', 'Harley-Davidson', 'Waste Management Inc.', 'Cognizant Technology Solutions', 'State Street Corp.', 'Centene Corporation', 'Host Hotels & Resorts', 'Dominion Resources', 'Charles Schwab Corporation', 'International Paper', 'United Technologies', 'Simon Property Group Inc', 'Huntington Bancshares', 'Deere & Co.', 'Verisign Inc.', 'Altria Group Inc', 'NextEra Energy', 'Equity Residential', 'American Tower Corp A', 'Archer-Daniels-Midland Co', 'AMETEK Inc', 'Albemarle Corp', 'General Dynamics', 'Hasbro Inc.', 'McCormick & Co.', 'PNC Financial Services', 'The Hershey Company', 'Fortune Brands Home & Security', 'United Parcel Service', 'Praxair Inc.', 'E*Trade', 'HP Inc.', 'Flowserve Corporation', 'Kimco Realty', 'Apartment Investment & Mgmt', 'Church & Dwight', 'Ventas Inc', "Moody's Corp", 'Hartford Financial Svc.Gp.', 'American Airlines Group', 'The Walt Disney Company', 'The Bank of New York Mellon Corp.', 'Navient', 'Affiliated Managers Group Inc', 'Dentsply Sirona', 'Discover Financial Services', 'Genuine Parts', 'Macerich', 'Pitney-Bowes', 'BorgWarner', 'Philip Morris International', 'Scripps Networks Interactive Inc.', 'SL Green Realty', 'Nielsen Holdings', 'Public Serv. Enterprise Inc.', 'Sempra Energy', 'Quest Diagnostics', 'General Growth Properties Inc.', 'Stericycle Inc', 'Cincinnati Financial', 'Federal Realty Investment Trust', 'Interpublic Group', 'Tegna, Inc.', 'Northern Trust Corp.', 'MetLife Inc.', 'Total System Services', 'Newell Brands', 'Phillips 66', 'XL Capital', 'Leggett & Platt', 'Boeing Company', 'Comerica Inc.', 'Eastman Chemical', 'AvalonBay Communities, Inc.', 'Grainger (W.W.) Inc.', 'Verisk Analytics', 'Boston Properties', 'Assurant Inc', 'Under Armour', 'Whirlpool Corp.', 'Activision Blizzard', 'Norfolk Southern Corp.', 'Cummins Inc.', 'Level 3 Communications', 'Crown Castle International Corp.', 'LyondellBasell', 'Marathon Petroleum', 'CBRE Group', 'PayPal', 'General Motors', 'Essex Property Trust, Inc.', 'PG&E Corp.', 'Pentair Ltd.', 'PepsiCo Inc.', 'Delta Air Lines', 'Pinnacle West Capital', 'Robert Half International', 'HCP Inc.', 'Nucor Corp.', 'Fidelity National Information Services', 'Ryder System', 'Zimmer Biomet Holdings', 'C. H. Robinson Worldwide', 'American International Group, Inc.', 'Laboratory Corp. of America Holding', 'Corning Inc.', 'Caterpillar Inc.', 'Discovery Communications-C', 'Dun & Bradstreet', 'Union Pacific', 'Thermo Fisher Scientific', 'CenturyLink Inc', 'Royal Caribbean Cruises Ltd', 'Digital Realty Trust', 'Merck & Co.', 'Invesco Ltd.', 'Texas Instruments', 'FMC Corporation', "O'Reilly Automotive", 'Dover Corp.', 'Weyerhaeuser Corp.', 'Equifax Inc.', 'Henry Schein', 'Chevron Corp.', 'Xerox Corp.', 'Bard (C.R.) Inc.', 'Sealed Air', 'Kansas City Southern', 'Prudential Financial', 'Jacobs Engineering Group', 'M&T Bank Corp.', 'BB&T Corporation', 'Citizens Financial Group', 'CIGNA Corp.', 'Hormel Foods Corp.', 'Tesoro Petroleum Co.', 'Yum! Brands Inc', 'AutoNation Inc', 'Danaher Corp.', 'Ecolab Inc.', 'Extra Space Storage', 'Arconic Inc', 'The Travelers Companies Inc.', 'Chubb Limited', 'Abbott Laboratories', 'Sherwin-Williams', 'Wec Energy Group Inc', 'LKQ Corporation', 'Mid-America Apartments', 'The Cooper Companies', 'CVS Health', 'Fluor Corp.', 'PPL Corp.', 'International Business Machines', 'Discovery Communications-A', 'Broadcom', 'Tyson Foods'] 267
# create a data frame with only companies that appear on the result_securities_list
df_result = data[data['Security'].isin(result_securities_list)]
# check the list for outliers below the lower quartile for any numeric column in the data frame
# create list of categories that you wouldn't want to invest if the company was a low outlier
categories = ['Price Change', 'ROE', 'Cash Ratio', 'Net Cash Flow', 'Net Income', 'Earnings Per Share', "P/E Ratio", "P/B Ratio"]
# Calculate the 25th percentile for each category
percentiles = {cat: df_result[cat].quantile(0.25) for cat in categories}
# Filter the DataFrame to exclude companies below the 25th percentile in any category
for cat, threshold in percentiles.items():
df_result = df_result[df_result[cat] >= threshold]
# reindex df_result
df_result = df_result.reset_index(drop=True)
# Display the resulting DataFrame
print(df_result['Security'])
0 Abbott Laboratories 1 AMETEK Inc 2 Amphenol Corp 3 Broadcom 4 Cincinnati Financial 5 Deere & Co. 6 The Walt Disney Company 7 General Growth Properties Inc. 8 Hormel Foods Corp. 9 Invesco Ltd. 10 Mastercard Inc. 11 Macerich 12 Moody's Corp 13 Mead Johnson 14 3M Company 15 Norfolk Southern Corp. 16 PepsiCo Inc. 17 Procter & Gamble 18 Progressive Corp. 19 Philip Morris International 20 PPG Industries 21 Simon Property Group Inc 22 Tyson Foods 23 Vornado Realty Trust Name: Security, dtype: object
This is a list of my final actionable recommendations. I would invest in the above companies. These companies were in the safest cluster and have been filtered for companies that performed poorly in any category you would not want a company to perform poorly in. There may be some companies that perform better in some areas, but they companies all perform relatively well in all areas. Closer scrutiny is required as I am not a financial advisor.